[WIP] make tracing a task public so self-tracing is possible #6972

arielb1 · 2024-11-13T16:31:52Z

There is some desire to make it possible for tasks to trace themselves to discover slow wakeups, something similar to the test I added.

This PR makes some functions public (but unstable) to make that easier.

I currently shared it with someone I'm working with to see whether it helps them debug their "slow task" problems. I'll make this PR less WIP after I get feedback from them.

WIP: actually add docs.

arielb1 · 2024-11-13T17:07:20Z

cc @carllerche

also for seeing how to make this pass CI without removing the test

arielb1 · 2024-11-18T11:49:42Z

[PR status: I'm having some people check out this PR to see if it helps them debug some "slow task" problems. After I see the lessons learned I'll make it less WIP]

jlizen · 2024-11-21T01:14:42Z

Chiming in, @arielb1 asked me to poke at this implementation in a somewhat realistic scenario where it might be useful.

One such case that I've encountered is while operating a service that uses a Redis cluster as a backend for distributed precision/adaptive throttling, with various multi-step state operations and varying key cardinality. Redis's single threaded event loop and key space partitioning makes it prone to bottlenecks if code is poorly optimized. Debugging performance issues can be complex and require a decent amount of context on Redis's underlying behaviors. Since, Redis-perspective per-command logs typically don't include time spent while waiting in the event loop queue, metrics tend to be heavily aggregated, and performance issues are difficult to directly trigger without large-scale load testing (and with realistic traffic shapes).

A bit of magic to systematically trace the slower futures without a lot of manual instrumentation by end users, and without introducing new bottlenecks due to blanket log/metric outputs, would be quite handy. This change would open up that door, and then other libraries could wrap their futures with self-tracing logic.

I threw together a crude simulation where I ran a redis cluster locally (8 nodes, key space evenly distributed). I then simulated a big (and fairly hot) key alongside a bunch of normal keys. My wrapper future takes a trace + dumps that trace as well as a human-readable key name, in case the duration of the total lookup is > 500ms.

This specific implementation is pretty heavy-handed for the use case above compared to a plain timer + simple event output. But, I can imagine end user scenarios where there are more complex futures that contain multiple i/o calls, different function contexts, etc, where the task trace might be handy. Ariel has another open PR that will also probably make access to the backtrace more useful.

Probably the larger benefit is for libraries rather than usage directly by end users. It might be nice for @arielb1 to look at how this API feels in e.g. a tower layer, which seems like a good use case for this.

My code and output are below. You'll see a subset of my bigkey calls tripping the threshold, as well as certain regular keys that are routed to the same node and stuck behind the bigkeys in line.

Here is my code:

use std::{future::Future, pin::Pin, task::{Context, Poll}, time::{Duration, Instant}};

use itertools::Itertools;
use rand::distributions::{Alphanumeric, DistString};
use redis::{cluster::ClusterClient, cluster_async::ClusterConnection, AsyncCommands};
use tokio::{runtime::dump::{Root, Trace}, task::JoinSet};
use tracing::{error, info_span, warn, Instrument};
use tracing_subscriber;

#[tokio::main]
async fn main() {
    tracing_subscriber::fmt::init(); 
    let client = ClusterClient::new(vec!["redis://127.0.0.1:6379"]).unwrap();
    let connection = client.get_async_connection().await.unwrap();

    let elements: Vec<(&'static str, &'static str)> = get_good_keys().into_iter().interleave(get_bad_keys().into_iter()).collect();

    let mut set = JoinSet::new();
    for (key, value) in elements.iter() {
        set.spawn(maybe_traced_redis_set(key, value, connection.clone()));
    }

    set.join_all().await;
}


async fn maybe_traced_redis_set(key: &'static str, value: &'static str, mut connection: ClusterConnection) {
    let span = info_span!("redis_set", key = key);
    let future = connection.set::<&str, &str, String>(key, value);
    let maybe_traced_future = SelfTracingFuture::new(future).instrument(span);

    if let Err(e) = maybe_traced_future.await {
        error!("error setting {key}: {e}");
    }
}


fn get_good_keys() -> Vec<(&'static str, &'static str)> {
    (0..200).map(|el| {
        let leaked: &'static str = el.to_string().leak();
        (leaked, leaked)
    }).collect()
}

fn get_bad_keys() -> Vec<(&'static str, &'static str)> {
    let big_key: &'static str = Alphanumeric.sample_string(&mut rand::thread_rng(), 5_000_000).leak();

    vec![("big_key", big_key); 50]
}


pin_project_lite::pin_project! { pub struct SelfTracingFuture<F: Future> {
        #[pin]
        f: Root<F>,
        t_last: State,
    }
}

enum State {
    NotStarted,
    Running { since: Instant },
    Alerted,
}


impl<F: Future> SelfTracingFuture<F> {
    fn new(f: F) -> Self {
        Self {
            f: Trace::root(f),
            t_last: State::NotStarted,
        }
    }
}


impl<F: Future> Future for SelfTracingFuture<F> {
    type Output = F::Output;

    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<F::Output> {
        let mut this = self.project();
        let now = Instant::now();
        let t_last = match this.t_last {
            State::Running { since } => Some(*since),
            State::NotStarted => {
                *this.t_last = State::Running { since: now };
                None
            }
            State::Alerted => {
                // don't double-alert for the same future
                None
            }
        };
        if t_last.is_some_and(|t_last| now.duration_since(t_last) > Duration::from_millis(500)) {
            let (res, trace) = tokio::runtime::dump::Trace::capture(|| this.f.as_mut().poll(cx));
            *this.t_last = State::Alerted;
            warn!("future ran long: {trace}");
            
            return res;
        }
        this.f.poll(cx)
    }
}

And here is my output:

     Running `build/private/cargo-target/debug/test_trace`
2024-11-21T02:20:21.795154Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.985727Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.985690Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.985997Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986514Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986774Z  WARN redis_set{key="191"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986933Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986032Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986290Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986339Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.985800Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986253Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986078Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.987240Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.986810Z  WARN redis_set{key="big_key"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
2024-11-21T02:20:21.988358Z  WARN redis_set{key="70"}: test_trace: future ran long: ╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
  └╼ redis::commands::AsyncCommands::set::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/commands/macros.rs:169:69
     └╼ redis::cmd::Cmd::query_async::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cmd.rs:434:48
        └╼ <core::pin::Pin<P> as core::future::future::Future>::poll at /rustc/f6e511eec7342f59a25f7c0534f1dbea00d01b14/library/core/src/future/future.rs:123:9
           └╼ redis::cluster_async::ClusterConnection<C>::route_command::{{closure}} at /workplace/jlizen/GokuBotsAtpAggregator/src/GokuRustCommons/build/private/cargo-home/registry/src/-6353be53fb99b376/redis-0.25.4/src/cluster_async/mod.rs:107:14
              └╼ <tokio::sync::oneshot::Receiver<T> as core::future::future::Future>::poll at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1112:30
                 └╼ tokio::sync::oneshot::Inner<T>::poll_recv at /workplace/jlizen/GokuBotsAtpAggregator/src/tokio/tokio/src/sync/oneshot.rs:1143:16
<snip, maybe half the bigkey calls were >500 ms and a scattering of other calls>

And then for completeness, here you can see the overlap of the bigkey + regular key futures that were running long - all were routed to node 4:

bash-5.2# redis-cli -c
127.0.0.1:6379> cluster keyslot big_key
(integer) 11045
127.0.0.1:6379> cluster keyslot 191
(integer) 11242
127.0.0.1:6379> cluster keyslot 70
(integer) 10833
127.0.0.1:6379> cluster nodes
8fd9761e7124eb3572428f6955a0d666ad804751 240.10.3.4:7001@17001 master - 0 1732152536000 7 connected 4096-6143
91490ff14cf53f07b17044a480162072bf3f6a3e 240.10.3.4:6379@16379 myself,master - 0 1732152535000 1 connected 0-2047
aef5d6bebe5d2a05cb1932657f7352bfc3e913ba 240.10.3.4:7000@17000 master - 0 1732152536782 0 connected 2048-4095
5d55504cde8d478a3006b14ead7097676d768546 240.10.3.4:7004@17004 master - 0 1732152535000 4 connected 10240-12287
04fcec03a691deb0d6dc65d83e1b4e647f5a113f 240.10.3.4:7006@17006 master - 0 1732152535000 5 connected 14336-16383
014709135cc1968a32a0f97d729a3a709d333eb1 240.10.3.4:7002@17002 master - 0 1732152537833 3 connected 6144-8191
6da488511ca9434ddd0c88f026c37741b059648a 240.10.3.4:7003@17003 master - 0 1732152535000 6 connected 8192-10239
218d002c72b76b9255056d990f09e11d82756227 240.10.3.4:7005@17005 master - 0 1732152534074 2 connected 12288-14335

jlizen · 2024-11-21T15:45:33Z

Going to also throw in a test against a more realistic application that has more complicated usage of the tokio runtime to see if anything rattles loose.

jlizen · 2024-12-16T18:56:23Z

I did some more testing in a real application. It is an axum server that handles user flows across a series of endpoints that span vending javascript to the client, vending inputs for a hashcash proof of work challenge, then various crypto operations to validate the solution and vend an encrypted token.

I injected the self-tracing task functionality in an outer tower middleware layer. It dumps a trace in case a request takes longer than 500ms. I then injected a 5% chance of a sleep inside a place where we make a remote call to AWS DynamoDB. This mimics a real performance bottleneck we encountered, due to overlarge table size, that was annoying to debug due to lack of specific metrics wrapping that remote call at the time.

I then simulated user realistic user flows that hit all endpoints in the browser, acquired tokens, etc. I didn't test at load but did send enough traffic to hit a bit of concurrency. I didn't see any signs of strange behavior with regard to the executor or otherwise. Seemed to behave as expected.

The stack trace was useful and pointed directly to the line of code that had the sleep added. See abbreviated output below:

warn 2024-12-16T18:45:51.813062959Z long running task: ╼ 
<snip> 
 └╼ test_app[5d607e4097ef04f4]::routing::apis::<test_app_backends[6d9dd0500a3162d0]::cp_datastore::backends::api_gateway_proxy::ApiGatewayProxy>::{closure#4} at /TestApp/test_app/src/routing.rs:269:22
    └╼ test_app[5d607e4097ef04f4]::handlers::verify::verify_handler::<test_app_backends[6d9dd0500a3162d0]::cp_datastore::backends::api_gateway_proxy::ApiGatewayProxy> at /TestApp/test_app/src/handlers/verify.rs:305:1
       └╼ test_app[5d607e4097ef04f4]::handlers::verify::verify_handler::<test_app_backends[6d9dd0500a3162d0]::cp_datastore::backends::api_gateway_proxy::ApiGatewayProxy>::{closure#0} at /TestApp/test_app/src/handlers/verify.rs:365:6
          └╼ test_app[5d607e4097ef04f4]::handlers::verify::verify_handler_inner::<test_app_backends[6d9dd0500a3162d0]::cp_datastore::backends::api_gateway_proxy::ApiGatewayProxy> at /TestApp/test_app/src/handlers/verify.rs:652:85
             └╼ test_app[5d607e4097ef04f4]::handlers::verify::check_and_insert_with_maybe_sleep at /TestApp/test_app/src/handlers/verify.rs:727:67
                └╼ <tokio[b5f4bbd0e64ee820]::time::sleep::Sleep as core[23aac720c248f827]::future::future::Future> at /TestApp/src/tokio/tokio/src/time/sleep.rs:448:22
                   └╼ <tokio[b5f4bbd0e64ee820]::time::sleep::Sleep> at /TestApp/src/tokio/tokio/src/time/sleep.rs:404:16

github-actions bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR labels Nov 13, 2024

arielb1 force-pushed the self-trace branch 3 times, most recently from 35d68a7 to 1716248 Compare November 13, 2024 17:02

arielb1 force-pushed the self-trace branch 4 times, most recently from f26e43e to d93b7cf Compare November 14, 2024 18:19

make self-tracing public

5375c24

arielb1 force-pushed the self-trace branch from d93b7cf to 5375c24 Compare November 14, 2024 19:16

Merge branch 'master' into self-trace

c9c6efd

Darksonn marked this pull request as draft November 18, 2024 11:58

Darksonn added A-tokio Area: The main tokio crate M-taskdump --cfg tokio_taskdump labels Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] make tracing a task public so self-tracing is possible #6972

[WIP] make tracing a task public so self-tracing is possible #6972

arielb1 commented Nov 13, 2024 •

edited

Loading

arielb1 commented Nov 13, 2024

arielb1 commented Nov 18, 2024

jlizen commented Nov 21, 2024 •

edited

Loading

jlizen commented Nov 21, 2024

jlizen commented Dec 16, 2024 •

edited

Loading

[WIP] make tracing a task public so self-tracing is possible #6972

Are you sure you want to change the base?

[WIP] make tracing a task public so self-tracing is possible #6972

Conversation

arielb1 commented Nov 13, 2024 • edited Loading

arielb1 commented Nov 13, 2024

arielb1 commented Nov 18, 2024

jlizen commented Nov 21, 2024 • edited Loading

jlizen commented Nov 21, 2024

jlizen commented Dec 16, 2024 • edited Loading

arielb1 commented Nov 13, 2024 •

edited

Loading

jlizen commented Nov 21, 2024 •

edited

Loading

jlizen commented Dec 16, 2024 •

edited

Loading